Alauddin sabari

Data Preprocessing with pandas.

Missing Data Handle

Let's show a few convenient methods to deal with Missing Data in pandas:

Reasons for missing Data

  1. User did not want to fill data due to privacy issues
  2. Loss of data while transferring
  3. Insufficient information to fill a particular column, etc.

There can be multiple reasons for missing values in a dataset

df.describe()

Here count returns Non-null values

Since NaN is considered float, the data is converted to float type

df.isna() and df.notna()

df.isnull()

Finding count of missing values in each column

Sometimes information for a column is insufficient to assign a value

Missing data for datetime

DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

DataFrame Columns are just Series

Creating a new column:

Removing Columns

Can also drop rows this way:

Selecting Rows

Or select based off of position instead of label

Selecting subset of rows and columns

Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

For two conditions you can use | and & with parenthesis:

More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

Frequency Histogram

Seaborn Documentation : https://seaborn.pydata.org/tutorial/distributions.html

Relative Frequency Histogram Plot

Density Plot

more examples

Reasons for missing Data

  1. User did not want to fill data due to privacy issues
  2. Loss of data while transferring
  3. Insufficient information to fill a particular column, etc.

There can be multiple reasons for missing values in a dataset

df.describe()

Here count returns Non-null values

Since NaN is considered float, the data is converted to float type

df.isna() and df.notna()

df.isnull()

Finding count of missing values in each column

Sometimes information for a column is insufficient to assign a value

Missing data for datetime

Finding Z-Scores

Example for variable

Dataset Used : Titanic ( https://www.kaggle.com/c/titanic )

This dataset basically includes information regarding all the passengers on Titanic . Various attributes of passengers like age , sex , class ,etc. is recorded and final label 'survived' determines whether or the passenger survived or not .

Columns

Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin: Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Numerical Variables

  1. Age
  2. SibSp
  3. Parch
  4. Fare

Age and Fare are Continuous Variable

Parch and SibSp are Discrete Variables

Categorical Variables

  1. Name
  2. Cabin
  3. Sex
  4. Embarked
  5. Survived
  6. Pclass

Nominal Values

Oridnal Variable

Univariate analysis

Dataset : https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

Calculate Skewness

Bivariate examples

  1. Survived: Outcome of survival (0 = No; 1 = Yes)
  2. Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  3. Name: Name of passenger
  4. Sex: Sex of the passenger
  5. Age: Age of the passenger (Some entries contain NaN)
  6. SibSp: Number of siblings and spouses of the passenger aboard
  7. Parch: Number of parents and children of the passenger aboard
  8. Ticket: Ticket number of the passenger
  9. Fare: Fare paid by the passenger
  10. Cabin Cabin number of the passenger (Some entries contain NaN)
  11. Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Add a new column - Family size

I will be adding a new column 'Family Size' which will be the SibSp and Parch + 1

Add a new column - Age Group

Multivariate

Dataset : https://www.kaggle.com/abcsds/pokemon

Dealing with Missing values

Store Legendary Pokemon Seperately

Pokemon count by Type 1

Pokemon count by Type 2

We will use Bokeh library for Drawing interactive plots

Code modified from : https://bokeh.pydata.org/en/latest/docs/user_guide/interaction/legends.html

Red Wine Data Analysis

Download Link : https://archive.ics.uci.edu/ml/datasets/wine+quality

Citation : P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Questions that we can try to answer ?

Descriptive Statistics

Since there are no null entries, we don't need to deal with missing values.

Descriptive Statistics

Analysis over Red Wine

Let's first check the Quality Column

Lets check which of the other columns are highly correlated to Quality

Alcohol content is positively skewed

Let's see how alcohol varies w.r.t. quality

Lets analyze sulphates and quality

Lets move on to fixed acidity, volatile acidity and citric acid

Trends between other columns

Create a new Column Total Acidity

Alauddin Sabari

alauddinsabari@gmail.com

alauddin.me